-
Notifications
You must be signed in to change notification settings - Fork 76
Update handling of mxOPAQUE_CLASS objects and Subsystem in MAT_v5.jl #205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9f44ce5 to
89572a7
Compare
|
With these changes, full support is added for loading classdef objects in MAT-files in both v5 and HDF5 formats. Classdef objects are returned as a The changes support different MAT-file and subsystem versions. It also supports loading all types of Some notes:
Edit: |
* MAT_subsys.jl: New file MAT_subsys with methods to set, parse and retrieve subsystem data * MAT_v5.jl: New method "read_opaque" to handle mxOPAQUE_CLASS * MAT_v5.jl: New method "read_subsystem" to handle subsystem data * MAT.jl (matread): Update to clear subsystem and object cache after load Support for loading mxOPAQUE_CLASS objects in v7.3 HDF5 format * MAT_HDF5.jl (matopen): New argument Endian indicator, Reads and parses subsystem on load * MAT_HDF5.jl (close): Update to write endian header based on system endianness * MAT_HDF5.jl (m_read::HDF5.Dataset): Update to handle MATLAB_object_decode (mxOPAQUE_CLASS) types * MAT_HDF5.jl (m_read::HDF5.Group): Update to read subsystem data and function_handles * MAT.jl (matopen): Update function calls Updated test for struct_table_datetime.mat to ensure accurate deserialization (including nested properties) in both v7 and v7.3 formats * test/read.jl: Update tests for "function_handles.mat" and "struct_table_datetime.mat"
|
Hi there, I like your PR and I hope you still want to work on it. I have my own PR-207 and I will use Is there a good way to separate normal struct arrays from MATLAB classes in the MAT.jl interface? Would the distinction come from the existence of the (I also hope an active contributor/maintainer pops up for MAT.jl to review our work. I am trying to get in contact with them via email) |
|
Hey! It does look like there will be a conflict in our PRs. I aimed to provide read-write support in this module, but I only included read support for MATLAB classes in this PR to make it reasonable to review. Coming to the typing, my opinion is that the same types should be used for both read and write, and the typing should be consistent across the different MAT-file versions. I worked on a Pythonic version of this module, and there I used a wrapper class Ultimately, it's a design choice that would benefit from a discussion surrounding user requirements. It would be good if a maintainer (@ViralBShah ?) could pitch in their thoughts. As I understand there's not much activity here, but I would be happy to help out in any way I can! |
|
I like the idea of a wrapper type, at least for read/write consistency. Having obscure key names like I wonder if it would be smart to have another wrapper type for the struct arrays? Some kind of mutable named tuple like type would be great, like: Looking at matio for inspiration, I found a struct array test, but I cannot quickly infer what Python type you are reading/writing? How do you differentiate between cell arrays of structs and struct arrays? Right now MAT.jl does not have read/write consistency for struct arrays (they get re-written as a struct with cell arrays per field name). |
|
@matthijscox is a maintainer and we can add others too. |
So my main point was to keep it consistent with For example, The only place this doesn't work is for empty structs which don't have field names. In this case, I've used a wrapper class
I do agree with this. In fact, I went through several iterations in my Python module before arriving at the current data representations. The main problem I had was ensuring separate types for simple read-write operations whilst keeping it user-friendly and scalable for the newer MATLAB classdef-based types. This lead to me using a bunch of wrapper classes that keep a balance between allowing numpy array operations and differentiability. Something of that sort can be done here if it aligns with the scope of this package, and I would be happy to help. I'm not familiar with Julia typing and will need some time, but in the meantime you can read the type conversions I used in |
src/MAT_subsys.jl
Outdated
|
|
||
| object_arr = Array{Dict{String,Any}}(undef, convert(Vector{Int}, dims)...) | ||
|
|
||
| for i = 1:length(object_arr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor issue: more Julian syntax would be for (i, oid) in enumerate(object_ids)
src/MAT_subsys.jl
Outdated
| end | ||
|
|
||
| const subsys_cache = Ref{Union{Nothing,Subsys}}(nothing) | ||
| const object_cache = Ref{Union{Nothing, Dict{UInt32, Dict{String,Any}}}}(nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code is pretty complicated, so I cannot oversee the decisions here, but I am curious why you need this stateful cache? Is it not possible to rewrite the code to just pass the subsys_cache and object_cache through each function?
I think we now we have the risk that someone tries to read multiple .mat files in multiple threads and this will break down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's been a while so I'm not sure exactly, but I think it had to with it involving a whole bunch of function definition changes to incorporate both the caches. There's a lot of recursive calls so the caches would have to be passed to all functions even if they don't actually use it.
I would prefer a better alternative to handle thread safety, but your suggestion works as well.
|
I've begun reading the code. The classes seem to be read mostly here as dicts in MAT_subsys.jl R285-R288. I suppose we could define a struct MatlabOpaque{T}
classname::Symbol
properties::Vector{String} # or Vector{Symbol}
values::Vector{T}
endOptionally I wonder if we want to be able to expose the class "type" to the Julia type system. (These Val types do come with a small performance penalty I remember) struct MatlabOpaque{C<:Val, T}
properties::Vector{String} # or Vector{Symbol}
values::Vector{T}
end
# outer constructor
function MatlabOpaque(classname::Symbol, properties::Vector{String}, values::Vector{T}) where T
return MatlabOpaque{Val{classname}, T}(properties, values)
endNow you can identify a vector of same class (class/struct array?) from a vector different classes (cell array?) MatlabOpaque{Val{:table}, Any}[] # class array of Matlab tables
MatlabOpaque[] # class (cell) array of different classes
Any[] # cell array of any kind of (matlab) typeThis also means we can write automatic converters: Base.convert(::Type{DataFrame}, MatlabOpaque{Val{:table}}) = ...or better yet add the Tables.jl interface: Tables.istable(::Type{MatlabOpaque{Val{:table}}) = true
Tables.istable(::Type{MatlabOpaque}) = falseOfcourse instead of the Val type parameter, we could just convert immediately to different known types during the reading. Then we have to define the multiple MAT.jl internal types we need, or immediately convert to the most sensible Julia type. struct MatlabTable
...
end
Tables.istable(::Type{MatlabTable}) = true
struct MatlabDuration <: Dates.AbstractTime # or use Dates.Millisecond type
...
endI would propose to first implement the |
|
If I'm understanding correctly, we'd have Some points to clarify/get clarified:
The rest sounds good. Doesn't sound like it diverges too much from the current state of the PR. To summarize the main edits:
|
Indeed. Note I now created a
You mean that you want to index by name Basically we can define methods like: function Base.getindex(object::MatlabOpaque, prop::AbstractString)
idx = findfirst(isequal(prop), object.properties)
return object.values[idx]
end
Pff, I will have to check. You also make me worried about these empty sized struct arrays for my PR. Is this the way to construct them?
Yeah, well
I very new to MAT.jl myself I admit. I saw the It's possible we're going to have merge conflicts with my PR, since I've been refactoring a bit there. So if you have any time to review my PR that would be great. |
To quickly follow-up on your question: The bad newsCurrently (MAT v0.10) we cannot. An empty >> s00
s00 =
struct with fields:
c: {}
b: {}
a: {}Good newsIn my new struct array PR I can write them! (reading goes wrong, I'll fix that): In Julia: empty_sarr = MAT.MatlabStructArray(["a", "b", "c"], [Matrix{Any}(undef, 0, 0), Matrix{Any}(undef, 0, 0), Matrix{Any}(undef, 0, 0)])
matwrite("test.mat", Dict("s00" => empty_sarr))In MATLAB: >> load('test.mat')
>> s00
s00 =
0×0 empty struct array with fields:
a
b
c |
Haha yeah, sorry. It's an edge case from a user perspective, but MATLAB uses these zero-element dimensions as placeholders, and getting them wrong results in unexpected errors when loading in MATLAB. Ideally if the
yeah, sounds good.
I believe I did notice some differences between the HDF5 and v5 versions. The HDF5 file isn't as extensive as the v5 file, and I honestly think it needs a rewrite to allow consistency between versions. But that's for another time. |
|
I now support empty struct arrays in my PR #207. Create one easily via I also created the I think that's all we need to continue with your PR. I will probably merge soon and then if you want I can help with this PR to solve any merge conflicts for you. |
|
Thanks! I'm a bit swamped till the end of the month and will most probably take a look at this after, will try to squeeze something in as soon as I can |
| merge!(prop_dict, get_properties(subsys, oid)) | ||
| # cache it | ||
| obj = MatlabOpaque(prop_dict, classname) | ||
| subsys.object_cache[oid] = obj |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The object should be cached before merge!. Else get_properties can result in an infinite recursion for handle class objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see... Should we add a unit test for these handle class objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that'll be a good idea, I'll add tests for handle class objects and object arrays.
src/MAT_subsys.jl
Outdated
|
|
||
| function get_default_properties(subsys::Subsystem, class_id::UInt32) | ||
| prop_vals_class = subsys.prop_vals_defaults[class_id+1, 1] | ||
| # is it always a MatlabStructArray? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think prop_vals_class will be a MatlabStructArray. It should always be a scalar 1x1 struct. Did you come across an example that goes against this?
Also, we can get rid of the copy statement. Instead, something like this to identify nested objects should work.
for (k, v) in prop_vals_class
prop_vals_class[k] = update_nested_props!(v, subsys)
end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had errors for 0x0 MatlabStructArray so this case definitely happens in the test data somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay makes sense, MATLAB writes 0x0 structs if there are no default values. But we'll still need to loop through the values looking for nested objects. I'll add a test for nested objects in saved and default values as well.
src/MAT_types.jl
Outdated
| return DateTime[] | ||
| end | ||
| if !isempty(obj["tz"]) | ||
| @warn "no timezone conversion yet for datetime objects. timezone ignored" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be a good idea to display the timezone the datetime object was saved with in the warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Note I didn't want to add TimeZones.jl yet because I remember that it's a massive dependency, adding lots of precompile or artifact download time
src/MAT_types.jl
Outdated
| if num_strings==1 | ||
| return first(strings) | ||
| else | ||
| return reshape(strings, shape...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the job is failing here. Maybe explicitly cast to a tuple?
|
Somehow I increased SnoopCompile invalidations a bit, this might mean package loading is slightly slower, but it's all due to dependencies (mostly HDF5.jl). Not sure I know how to fix this, so maybe we'll just take the hit. On master branch, according to CI: On my current local commit: Seems I managed to suppress a few HDF5 invalidations (probably by defining a few more explicit typed |
|
Slightly confused right now. Locally on Julia 1.12.0 I have 883 invalidations. CI (on 1.12.1) says 1474 invalidations Switched to Julia 1.12.1 with a fresh environment. Got 1040 invalidations |
|
I added tests for handle class objects and nested properties. I also added support for loading dynamic properties (with tests). Made some tiny changes to accommodate - object arrays are now I think we should clean up this PR and focus on supporting more classes/types in a different one. I will add write support for |
|
I agree. Do you see any more cleanup needed? Because I'd love to release MAT.jl v0.11 after merging this PR. |
I think this will fail if one the objects in the array gets automatically converted, e.g. to a DateTime. I will add such a test case tomorrow. |
I'll just add some comments to try and improve readability of The invalidations stuff is pretty new to me. If I understand correctly, the number of invalidations in the PR must be less than that in master? If its not critical, can we bypass it and tackle it in a separate issue?
Ah so this is a unique quirk of MATLAB, and shouldn't be an issue. For user-defined classes, object arrays contain a separate object instance in each element of the array (unless they are handle class objects which may be references). Also, all objects in an array must be of the same class. But for MATLAB datatypes like |
Yeah I can merge with the "invalidations CI step" failing. Then I suppose we have a new baseline.
If this is true, that means we can actually skip the
We could add a formatting style to MAT.jl, none has been chosen so far. I typically opt for bluestyle. We can at least reformat our new files in this PR.
The MCOS system seems obscure and the code obvious reflects that a bit. I'm just happy it works at all. We can refactor and comment on the obscure parts, but I'm okay with what we have. (Though what the on earth are some of these fields like |
|
I took the liberty to refactor a bit and format the style for MAT_types.jl and MAT_subsys.jl |
|
Sounds good. Everything with subsystem is reverse engineered so we can only guess as to what they do. Some fields are unknown still, but we still need them to write back to a MAT-file. Interestingly, MATLAB does not perform any sanity checks when loading subsystem data, so you can easily crash it by simply going out of bounds in array indexing. For safety, we could add sanity checks here if required. I wanted one thing clarified before we move forward. I noticed that both MATLAB char arrays and string arrays are returned as Julia strings. It would be better if we could somehow separate this behaviour. Char arrays are written as a bunch of ASCII or UTF-8 encodings, but string arrays use the MCOS opaque class format. It's not an issue in What are your thoughts on this? Finally, I'm listing a bunch of issues we can close after merging this PR. |
I see. We can obviously postpone until the writing PR. We could use the same approach as you do, but then we need to search for another Julia string type other than the base UTF-8 |
|
Seems to me this PR is ready. Let me know if you are happy with the code or want to do further refactoring. Once you give the go, I'll merge and register v0.11. |
|
Lets go for it. We can open new issues for the other stuff, they won't affect base functionality.
I think they're a mix of ASCII or UTF-8, even though MATLAB documentation says they use UTF-16. It might be optimized during save based on content? MATLAB strings though are definitely using UTF-16. |
Spinoff from #23 to read the undocumented datatype
mxOPAQUE_CLASS. Part of a series of changes to read and write these types across formatsv5andv7.3.Some context :
uint8array. However, this looks like another MAT-file that needs to be converted and read into. This needs to parsed before reading any variables in file.Edit:
Added a new file "MAT_subsys.jl" which contains methods for caching, parsing, and retrieving subsystem data to be assigned to an object. With this it should successfully load classdef objects. Additional context regarding how subsystem data is organized below:
uint32integers even though its written in asuint8-- Block 1 is a version indicator and some offset values
-- Block 2 is a list of class and property names as uint8 integers (null terminated)
-- Block 3 is a list of class IDs
-- Blocks 4 and 6 contain some metadata about how linking property names and property values
-- Block 5 is a list of object ID metadata
-- Block 7 is a list of dynamic properties attached to the object
-- Blocks 8 and 9 are unknown